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Abstract 

We propose a Bayesian approach for recursively estimating the classifier weights in online 
learning of a classifier ensemble. In contrast with past methods, such as stochastic gradient 
descent or online boosting, our approach estimates the weights by recursively updating its 
posterior distribution. For a specified class of loss functions, we show that it is possible to 
formulate a suitably defined likelihood function and hence use the posterior distribution as 
an approximation to the global empirical loss minimizer. If the stream of training data is 
sampled from a stationary process, we can also show that our approach admits a superior 
rate of convergence to the expected loss minimizer than is possible with standard stochastic 
gradient descent. In experiments with real-world datasets, our formulation often performs 
better than state-of-the-art stochastic gradient descent and online boosting algorithms. 
Keywords: Online learning, classifier ensembles, Bayesian methods. 


1. Introduction 


The basic idea of classifier ensembles is to enhance the performance of individual classifiers 
by combining them. In the offline setting, a popular approach to obtain the ensemble weights 
is to minimize the training error, or a surrogate risk function that approximates the training 
error. Solving this optimization problem usually calls for various sorts of gradient descent 
methods. For example, th e most successful and popu l ar ensemble technique, boosting, can 


be viewed in suc h a way (IFreun d and Scha pire 


[ popu l ar ensempie tecnmque, boosting, cai 
, Il995l : Mason et ah . 19991 : Friedman . 2001 


Telgarskv . 2012h . Given the success of these ensemble techniques in a variety of batch 
learning tasks, it is natural to consider extending this idea to the online setting, where the 
labeled sample pairs {x t , yt}J=i are presented to and processed by the algorithm sequentially, 
one at a time. 

Indeed, online versions of ensemble methods have been proposed from a spectrum 
of perspectives. Some of these works focus on close approximation of offline ensemble 
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schemes, such as boosting ( Oza and Russell . 2001 : Pelossof et ah . 20091) 


ods are based on stochasti c gradient d escent (IBabenko et al.l . l2009bl : iLeistner et al 


Other met h- 

200i : 


Grbovic and Vucetid . 1201 ll ) . Recently, Chen et al. ( 20121 ) formulated a smoothed boosting 


algorithm based on the analysis of regret from offline benchmarks. De spite their success in 


applications ( Grabner and Bischof . 20061 : Babenko et ah . 2009a), however, there are 


many 

some common drawbacks of these online ensemble methods, including the lack of a univer¬ 
sal framework for theoretical analysis and comparison, and the ad hoc tuning of learning 
parameters such as step size. 

In this work, we propose an online ensemble classification method that is not based 
on boosting or gradient descent. The main idea is to recursively estimate a posterior 
distribution of the ensemble weights in a Bayesian manner. We show that, for a given class 
of loss functions, we can define a likelihood function on the ensemble weights and, with an 
appropriately formulated prior distribution, we can generate a posterior mean that closely 
approximates the empirical loss minimizer. If the stream of training data is sampled from 
a stationary process, this posterior mean converges to the expected loss minimizer. 

Let us briefly explain the rationale for this scheme, which shall be contrasted from the 
usual Bayesian setup where the likelihood is chosen to describe closely the generating process 
of the training data. In our framework, we view Bayesian updating as a loss minimization 
procedure: it provides an approximation to the minimizer of a well-defined risk function. 
More precisely, this risk minimization interpretation comes from the exploitation of two 
results in statistical asymptotic theory. First is that, under mild regularity conditions, a 
Bayesian posterior distribution tends to peak at the maximum likelihood estimate (MLE) 
of the same likelihood function, as a consequence of the so-called Laplace method (iMacKavl . 
2003). Second, MLE can be viewed as a risk minimizer, where the risk is defined precisely 


as the expected negative log-likelihood. Therefore, given a user-defined loss function, one 
can choose a suitable log-likelihood as a pure artifact, and apply a corresponding Bayesian 
update to minimize the risk. We will develop the theoretical foundation that justifies the 
above rationale. 

Our proposed online ensemble classifier learning scheme is straightforward, but powerful 
in two respects. First, whenever our scheme is applicable, it can approximate the global 
optimal solution, in contrast with local methods such as stochastic gradient descent (SGD). 
Second, assuming the training data is sampled from a stationary process, our proposed 
scheme possesses a rate of convergence to the expected loss minimizer that is at least 
as fast as standard SGD. In fact, our rate is faster unless the SGD step size is chosen 
optimally, which cannot be done a priori in the online setting. Furthermore, we also found 
that our method performs better in experiments with finite datasets comp ared with the 


averaging schemes in SGD ( Polvak and .Tuditskv . 19921 : Schmidt et al. . 2013 1 that have the 


same optimal theoretical convergence rate as our method. 

In addition to providing a theoretical analysis of our formulation, we also tested our ap¬ 
proach on real-world datasets and compared with individual classifiers, a baseline stochastic 
gradient descent method for learning classifier ensembles, and their averaging variants, as 
well as state-of-the-art online boosting methods. We found that our scheme consistently 
achieves superior performance over the baselines and often performs better than state-of- 
the-art online boosting algorithms, further demonstrating the validity of our theoretical 
analysis. 


2 





















































A Bayesian Approach for Online Classifier Ensemble 


In summary, our contributions are: 

1. We propose a Bayesian approach to estimate the classifier weights with closed-form 
updates for online learning of classifier ensembles. 

2. We provide theoretical analyses of both the convergence guarantee and the bound on 
prediction error. 

3. We compare the asymptotic convergence rate of the proposed framework versus previ¬ 
ous gradient descent frameworks thereby demonstrating the advantage of the proposed 
framework. 

This paper is organized as follows. We first briefly discuss the related works. We then 
state in detail our approach and provide theoretical guarantees in Section [31 A specific 
example for solving the online ensemble problem is provided in Section U and numerical 
experiments are reported in Section [5J We discuss the use of other loss functions for online 
ensemble learning in Section [6] and conclude our paper in Section [7] with future work. Some 
technical proofs are left to the Appendix. 


2. Related work 


There is considerable past work on onli ne ensemble l e arning. Many past works have f o cused 


on online lea rning with concept drift ( Wang et, ah . 200.4 Kolter and Maloof . 200.4 2007 : 


Minku l, 2011), where dynamic strategies of pruning and rebuilding ensemble members are 
usually considered. Given the tech n ical d ifficulty, theoretical analysis for concept drift seems 


to be underdeveloped. Kolter and Malo of 020051 ) proved error bounds for their proposed 


method, which appears to be the first such theoretical analysis, yet suc h analysis is not 
easi ly generalized to other methods in this category. Other works, such as Schapire ( 200ll ). 
and Cesa-Bianchi and Lugosi ( 20031) . obtained performance bounds from the perspective of 
iterative games. 

Our work is more closely related to methods that operate in a stationary environ¬ 
ment, most notably some online boosting methods. One of the first methods was proposed 
by Oza and Russell (12001 ). who showed asymptotic convergence to batch boosting under 
certain conditions. H oweve r, the convergence result only holds for some simple “lossless” 
weak classifiers ( Ozal . l200lh . such as Naive Bayes. Other vari ants of online boost ing have 
been proposed, su ch as methods that employ feature selection (Grabner and Bischof . 20061 : 


Liu and Yul. 120071) . s emi-su pervised learning ([Grabner et al 
ing ( Babenko et ah , 2009al ). and multi-class learning ( Saffari et al! 201 


2008), multiple instance learn 


-ip! 

(j). Ho wever , mos t 


of these works consider the design and update of weak classifiers beyond that oflOzal (]200ll ) 
and, thus, do not bear the convergence guarantee therein. Other met ho ds em ploy the 


gradient descent fra mework, such as Onlin e GradientBoost ([Leistner et all 120091). Online 


Stoch astic Boosting ([Babenko et al., 2009b) and Incremental Boosting (IGrbovic and Vucetic 
201 1b There are convergence results given for many of these, which provide a basis for 
comparison with our framework. In fact, we show that our method compares favorably to 
gradient descent in terms of asymptotic convergence rate. 

Other recent online boosting methods (Chen et al., 2012; Beygelzimer et al., 2015) gen¬ 
eralize the weak learning assumption to online learning, and can offer theoretical guarantees 
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on the error rate of the learned strong classifier if certain performance assumptions are sat¬ 
isfied for the weak learners. Our work differs from these approaches, in that our formulation 
and theoretical analysis focuses on the classes of loss functions, rather than imposing as¬ 
sumptions on the set of weak learners. In particular, we show that the ensemble weights in 
our algorithm converge asymptotically at an optimal rate to the minimizer of the expected 
loss. 

Our proposed optimization scheme is related t o two other lines of work. First is the so- 


called model-based m ethod for global optimization (jZlochin et all 120041 : iRubinstein and Kroese . 


20041 : IHh et, ZL 120071 ) . This method iteratively generates an approximately optimal solution 
as the summary statistic for an evolving probability distribution. It is primarily designed 
for deterministic optimization, in contrast to the stochastic optimization setting that we 
consider. Secon d, o ur approach is, at least superficially, related to Bayesian model av¬ 


eraging (BMA) ( Hoeting et ah . 1999 ). While BMA is motivated from a model selection 


viewpoint and aims to combine several candidate models for better description of the data, 
our approach does not impose any model but instead targets at loss m ini miza. tion. 

20141 ). We make 


The present work builds on an earlier conference paper (jBai et al. 


several generalizations here. First, we remove a restrictive, non-standard requirement on 
the loss functi on (wh ich enforces the loss function to satisfy certain integral equality; As¬ 


sumption 2 in Bai et ah . 2014 ). Second, we condu ct experi ments that compare our for¬ 


mulation with two variants of the SGD baseline in Bai et al. ( 2014h . where the ensemble 


weig hts are estimated via two a veraging schemes of SGD, namely P olyak-Juditsky avera. 


ing ( Polvak and Juditsky . 1992h and Stochastic Averaging Gradient ( Schmidt et ah , 20131 ). 


Third, we evaluate two additional loss functi ons f or ensemble learning and compare them 
with the loss function proposed in Bai et al. ( 20141 b 


3. Bayesian Recursive Ensemble 

We denote the input feature by x and its classification label by y (1 or —1). We assume that 
we are given m binary weak classifiers {cj(x)}^ =1 , and our goal is to find the best ensemble 
weights A = (Ai,..., A m ), where A* > 0, to construct an ensemble classifier. For now, we do 
not impose a particular form of ensemble method (we defer this until Section |4|), although 
one example form is JT A jC,(x). We focus on online learning, where training data (x, y) 
comes in sequentially, one at a time at t = 1, 2, 3,.... 

3.1 Loss Minimization Formulation 

We formulate the online ensemble learning problem as a stochastic loss minimization prob¬ 
lem. We first introduce a loss function at the weak classifier level. Given a training pair 
(x, y) and an arbitrary weak classifier h, we denote g := g(h(x),y) as a non-negative loss 
function. Popular choices of g include the logistic loss function, hinge loss, ramp loss, 
zero-one loss, etc. If h is one of the given weak classifiers c*, we will denote g(ci(x),y) as 
gi(x,y), or simply g t for ease of notation. Furthermore, we define g\ := g(c t i ('x t ),y t ) where 
(x t ',y t ) is the training sample and c\ the updated z-th weak classifier at time t. To simplify 
notation, we use g := (< 71 ,... ,g m ) to denote the vector of losses for the weak classifiers, 
g f := ( g\,...,gl n ) to denote the losses at time t, and g 1:7 := (g 1 ,...,g T ) to denote the 
losses up to time T. 
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With the above notation, we let £(X‘,g t ) be some ensemble loss function at time t, 
which depends on the ensemble weights and the individual loss of each weak classifier. 
Then, ideally, the optimal ensemble weight vector A* should minimize the expected loss 
E[i{ A, (/)], where the expectation is taken with respect to the underlying distribution of the 
training data p(x, y). Since this data distribution is unknown, we use the empirical loss as 
a surrogate: 

T 

L T { A;g 1:r )=4(A) + ^^(A;g t ) (1) 

t= l 


where 4 (A) can be regarded as an initial loss and can be omitted. 

We make a set of assumptions on Lt that are adapted from Chen ( 1 9851 ) : 


Assumption 1 (Regularity Conditions) Assume that for each T, there exists a Xf that 
minimizes ©. and 

1. “local optimality”: for each T, VLr( Xf-,g l ' T ) = 0 and \7 2 Lt( A^; g 1 :T ) is positive 
definite, 

2. “steepness”: the minimum eigenvalue o/V 2 Lr(A^; g 1:T ) approaches oo as T —>• oo, 

3. “smoothness”: For any e > 0, there exists a positive integer N and 5 > 0 such that 
for any T > N and 6 € Hs(X?f) = {6 : ||6 — A^lh < 5}, V 2 Ly(0;g 1:T ) exists and 
satisfies 

I ~ A(e) < V 2 Lt(6- g 1:T ) (v 2 Lt(AT g 1:T )) < I + A(e) 

for some positive semidefinite symmetric matrix A(e) whose largest eigenvalue tends 
to 0 as e —^ 0 , and the inequalities above are matrix inequalities, 


4 • “concentration”: for any 5 > 0, there exists a positive integer N and constants c,p > 0 
such that for any T > N and 6 ^ Hs(Xf), we have 

L T (0;g 1:T )-L r (A^;g 1:T )< 
c ((6 - X* T y\/ 2 L T (X* T] g 1:T )(0 - A^)) P , 


5. “integrability”: 

J e _i ' r ^ A;gl ' T ' ) dA < oo 


In the situation where t is separable in terms of each component of A, i.e. £(X: g) = 
X^i r *(^bg) an d 4(A) = Y17L\Si(Xi) for some twice differentiable functions r - *(-; g) and 
Si(-), the assumptions above will depend only on /,.( A;g 1;r ) := Ylt=i r i(^ : g*) + f° r 
each i. For example, Condition 3 in Assumption Q] reduces to merely checking uniform 
continuity of each f g 1:T )- 

Condition 1 in Assumption [T| can be interpreted as the standard first and second order 
conditions for the optimality of A^, whereas Condition 3 in essence requires continuity of 
t he Hessi a n ma trix. Conditions 2 and 4 are needed for the use of the Laplace method 
( MacKav . 20031 ). which, as we will show later, stipulates that the posterior distribution 
peaks near the optimal solution A^ of empirical loss ©. 
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3.2 A Bayesian Approach 

We state our procedure in Algorithm [TJ We define p(g|A) = e~^ A;g ) and p( A) = e~^ 0 ^. 


Algorithm 1 Bayesian Ensemble 

Input: streaming samples {(x* , y t )}J =1 
online weak classifiers {c|(x)}^ =1 
the functions p(g|A) and p( A) 

Initialize: hyper-parameters for p(g|A) and p( A) 
for t = 1 to T do 

Vi, compute g\ = g((^(x t ),y t ) 

update for the “posterior distribution” of A : 
p{X\g 1:t ) oc p(g t |A)p(A|g 1:t_1 ) oc n P(g s |A)p(A) 

S=1 

update the weak classifiers using (x t ,y t ) 

end for 


Algorithm [T] requires some further explanation: 

1. Our updated estimate for A at each step is the “posterior mean” for A, given by 

/A]l P(s s \X)p(X)dX 

s =1 

/ n P(g s Mp(A)dA 

S= 1 


2. When the loss function £ satisfies 


/ 


e- £(A;w) dw = 1 


( 2 ) 


and £q satisfies 


/ 


e -£o( w ) dw = 1 


then p(g|A) is a valid likelihood function and p( A) a valid prior distribution, so that 
p(A|g 1:t ) as depicted in Algorithm |T] is indeed a posterior distribution for A (i.e. the 
quote-and-quote around “posterior distribution” in the algorithm can be removed). 
In this context, a good choice of p( A) = e.g. as a conjugate prior for the 

likelihood p(g|A) = e^ A;s ^, can greatly facilitate the computational effort at each 
step. On the other hand, we also mention that such a likelihood interpretation is not 
a necessary requirement for Algorithm [l] to work, since its convergence analysis relies 
on the Laplace method, which is non-probabilistic in nature. 


Algorithm Q] offers the desirable properties characterized by the following theorem. 
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Theorem 1 Under Assumption Ql the sequence of random vectors At with distributions 
p T ( A|g 1;r ) in Algorithmic satisfies the asymptotic normality property 

(V 2 L t (A* t - g 1:T )) 1/2 (At - A^) 4 JV(0,1) (3) 

where At is interpreted as a random variable with distribution p T (A |g 1:T ), and -4- de¬ 
notes convergence in distribution. Furthermore, under the uniform integrability condition 
sup T .E^ T |gi:T|| At — Ay||} +e < oo for some e > 0, we have 

I^AtIs^IAt] - AtI (4) 

where E Xt ^i : t[-] denotes the posterior mean and gt is the minimum eigenvalue of the 
matrix V 2 Lt(A^; g 1:T ). 


Proof Let 


L T (A;g i:T ) = L T (A;g i:r ) + log J e LT(A;gi ' T) <iA 


which is well-defined by Condition 5 in Assumption [T] Note that e i s a valid 

probability density in A by definition. Moreover, Conditions 1-4 in Assumption []] all hold 
when Lt is replaced by Lt (since they all depend only on the gradient of Lt{A; g 1:T ) with 
respect to A or the difference Lt(Ai; g 1:T ) — Lt{At, g 1:T )). 

The convergence in © then follows from Theorem 2.1 in Chenl ( 19851 ) applied to the 
sequence of densities g-CrlNg 1 ■ T ) f or j, = 12 ,.... Condition 1 in Assumption Q] is equivalent 
to conditions (PI) and (P 2 ) therein, while Conditions 2 and 3 in Assumption [T] correspond 
to (Cl) and (C2) in Chenl (1985). Condition 4 is equivalent to (C3.1), which then implies 
(C3) there to invoke its Theorem 2.1 to conclude (|3j). 

To show the bound (HD we take the expectation on ® to get 


(V 2 L T (A^;g 1:T )) 5 (E XT]gl : T [A T ] - A* T ) 0, 


(5) 


which is valid because of the uniform integrability condition sup T E X \ e i,t || At—A^H^ < 00 

( Durrett . 2Q10l h Therefore, E Xt ^i : t[At] — A^ = (V 2 L( A^;g 1:T )) 5 w t where w t = o(l) 
by ©. But then 


< 


< 


(V 2 L t (At; g 1:T )) 
(V 2 L t (At; g 1:T )) 


C 


Wt 1 = O 


a, 



W T111 


where 


when applied to matrix is the induced Li-norm. This shows HD- 
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The idea behind Q comes fro m classic al Bayesian asymptotics and is an application 
of the so-called Laplace method ( MacKav . 20030 . Theorem [T] states that given the loss 
structure satisfying Assumption HJ the posterior distribution of A under our update scheme 
provides an approximation to the minimizer of the cumulative loss at time T, as T 
increases, by tending to a normal distribution peaked at A^ with shrinking variance (Xf 
here can be interpreted as the maximum a posterior (MAP) estimate). The bound Q 
states that this posterior distribution can be summarized using the posterior mean to give 
a point estimate of Xf. Moreover, note that Xf is the global, not merely local, minimizer 
of the cumulative loss. This approximation of global optimum highlights a key advantage 
of the proposed Bayesian scheme over other methods such as stochastic gradient descent 
(SGD), which only find a local optimum. 

The next theorem states another benefit of our Bayesian scheme over standard SGD. 
Suppose that SGD does indeed converge to the global optimum. Even so, it turns out that 
our proposed Bayesian scheme converges faster than standard SGD under the assumption 
of i.i.d. training samples. 

Theorem 2 Suppose Assumption [7] holds. Assume also that g f are i.i.d,., with E[£{ A; g)] < 
oo and E[£(\[g) 2 ] < oo. The Bayesian posterior mean produced by Alg. [7] converges to 
argmin x E[£( A; g)] strictly faster than standard SGD (supposing it converges to the global 
minimum), given by 

Ay+i ^ A r - e T KV£(Xr, g T ) ( 6 ) 

in terms of the asymptotic variance, except when the step size e T and the matrix K is chosen 
optimally. 

In Theorem [21 by asymptotic variance we mean the following: both the sequence of posterior 
means and the update sequence from SGD possess versions of the central limit theorem, in 
the form \/T(At — A*) —>■ 1V(0, E) where A* = argmin^E^A; g)]. Our comparison is on the 
asymptotic covariance matrix E with respect to matrix inequality: for two update schemes 
with corresponding asymptotic covariance matrices Ei and E 2 , Scheme 1 converges faster 
than Scheme 2 if E 2 — Ei is positive definite. 

Proof The proof follows by combin ing (1H) with establ ished central limit theorems for 
sample average approximation ( Pasupathv and Kim . 2011 1 and stochastic gradient descent 
(SGD) algorithms. First, let 2 (A) := E[£( A; g)] , and A* := argmin A z(A). Note that the 
quantity Xf, is the minimizer of ^ Ylt=i • Then, together with the fact that 


is asymptotically negligible, Theorem 5.9 in Pasupath v and Kim (2011) stipulates that 
VT(\* t — A*) 


A r (0, E), where 

S = (V 2 ^(A))- 1 l/ar(V£(A;g))(V 2 2 (A ))- 1 


(7) 


and Var{-) denotes the covariance matrix. 

Now since V 2 Lt(A^; g 1:T ) = Z^fV^A-g*)) and T £f=i(V 2 7 (A^; g *)) E [V 2 £( A*;g)] 

by the law of large numbers ( Durrettl . l2010l l. we have V 2 Lt(A^; g 1:T ) = 0(T). Then the 

bound in (jU implies that |-EA T | g i:T [Ay] — a tI - °(jr) . In other words, the difference 
between the posterior mean and A^ is of smaller scale than 1 j\[T. By Slutsky Theorem 


(Serfling, 2009ll . this implies that VT{E — A*) N(0, E) also. 
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On the other hand, for SGD ([6]), it is known (e.g. Asmussen and Glvnn . 20071 ) that the 
optimal step size parameter value is e T = l/T and K = V 2 z(A), in which case the central 


limit theorem for the update A t will be given by VT(\t~ A*) -4 iV(0, X) where X is exactly 
(0. For other choices of step size, either the convergence rate is slower than order l/y/T or 
the asymptotic variance, denoted by X, is such that X — X is positive definite. Therefore, 
by comparing the asymptotic variance, the posterior mean always has a faster convergence 
unless the step size in SGD is chosen optimally. ■ 


To give some intuition from a statistical viewpoint, Theorem [2] arises from two layers 
of approximation of our posterior mean to A*. First, thanks to 0, the difference between 
posterior mean and the minimizer of cumulative loss A^ (which can be interpreted as the 
MAP) decreases at a rate faster than 1 /y/T. Second, A^ converges to A* at a rate of order 
l/y/T with the optimal multiplicative constant. This is equivalent to the observation that 
the MAP, much like the maximum likelihood estimator (MLE), is asymptotically efficient 
as a statistical estimator. 

Putting things in perspective, compared with local methods such as SGD, we have made 
an apparently stronger set of assumptions (i.e. Assumption [Tt) , which pays off by allowing 
for stronger theoretical guarantees (Theorems [T] and 0. In the next section we describe an 
example where a meaningful loss function precisely fits into our framework. 

4. A Specific Example 

We now discuss in depth a simple and natural choice of loss function and its corresponding 
likelihood function and prior, which are also used in our experiments in Section 0 Consider 

m m 

£{X: g) = 9 x i<h ~ lo § A * ( 8 ) 

2—1 2 — 1 

The motivation for 0 is straightforward: it is the sum of individual losses each weighted 
by Aj. The extra term logAj prevents Aj from approaching zero, the trivial minimizer for 
the first term. The parameter 0 specifies the trade-off between the importance of the first 
and the second term. This loss function satisfies Assumption [0 In particular, the Hessian 
of Lt turns out to not depend on g 1:T , therefore all conditions of Assumption [T| can be 
verified easily. 

Using the discussion in Section 13.21 we choose the exponential likelihood (note that in 
this definition we add an extra constant term mlog@ on (fSj) , which does not affect the 
minimization in any way) 

772 

p(g|A) = WidX^e- 6 ^ . (9) 

2—1 

To facilitate computation, we employ the Gamma prior: 

772 

#)»n i r lrA < 10 > 

2—1 
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where a and fi are the hyper shape and rate parameters. Correspondingly, we pick ^o(A) = 
f3 £™ i A i~ (a — 1) £™1 log A,. To be concrete, the cumulative loss in (HI) (disregarding the 
constant terms) is 

m m T / m m 

/3 ^ A* - (a - i) ^ log A* + ^ ( 0 ^ \g\ - ^ log \ 

i=1 i=1 t=1 \ i=1 i=1 

Now, under conjugacy of © and (fTUl) . the posterior distribution of A after t steps is given 
by the Gamma distribution 

m 

p{ A|g 1:t ) oc JJ(Ai) Q+t-1 e- ( ^ +e ^=i 3 > s)Ai . 

i= 1 



Therefore the posterior mean for each Aj is 

a +1 


P + OEU i9t 

We use the following prediction rule at each step: 


1 if £ Ai0i(x, 1) < £ Aj<7j(x, — 1) 
V \ i= 1 i=1 

— 1 otherwise 


(ii) 


( 12 ) 


where each Aj is the posterior mean given by (ED. For this setup, Algorithm Q] can be cast 
as Algorithm [2] below, which is to be implemented in Section [5j 


Algorithm 2 Closed-form Bayesian Ensemble 
Input: streaming samples {(x f ,y f )}£ 1 
online weak classifiers {c*(x)}T 1 

Initialize: parameters 9 for likelihood ([9D and parameters a, /3 for prior (HOD 
for t = 1 to T do 

Vi, compute g\ = g(c*(x*), ?/), where g is logistic loss function 
update the posterior mean of A by ED 

update the weak classifiers according to the particular choice of online weak 
classifier 


make prediction by (11211 for the next incoming sample 

end for 


The following bound provides further understanding of the loss function ((HD and the 
prediction rule (11 2 II . by relating their use with a guarantee on the prediction error: 

Theorem 3 Suppose that g* are i.i.d., so that A ^ converges to A* := argmin^E[£(X; g)] 
for £ defined in Q . The prediction error using rule ED with A* is bounded by 


P(*,y)(error) < mp 




9i(x,-y) \ p 1 

E i9i( x ,y)]) 


P -1 

P 


(13) 
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for any p > 1. 

To make sense of this result, note that the quantity E ^ g ^y. y ) ]g»( x , — y) can be interpreted as 
a performance indicator of each weak classifier, i.e. the larger it is, the better the weaker 
classifier is, since a good classifier should have a small loss £?[<?* (x, y)\ and correspondingly 
a large <7i(x, — y). As long as there exist some good weak classifiers among the m choices, 
i J[g X (’ x V y)\ b e large, which leads to a small error bound in (fT3l) . 


Proof Suppose A is used in the strong classifier (1121) . Denote /(•) as the indicator function. 
Consider 


E(*,y) 


22 X i9i(x,y) 


i= 1 


> 


> 


= E, 


/ ( 122 A ^( x , i )P(y = i|x) + 22 -i ) p (y = -!| x ) j dP(x) 

/ / m m m 

I H22 Xi9i ( x - > ^2 A *5*( x ,-!)) • 22 X i9i{*A)P(y = l|x) 

\ 2=1 2=1 2=1 

mm m \ 

+ I (22 x i9i(*, f) < 22 A *5'*( x ’ -1 )) ■ ^2 Xi9i ^ ~ l ) p (y = -l-l x ) ) dP(yi) 
2=1 2=1 2=1 / 

/ / m m m 

( i (22 Xi9i ^!) > 22 A *5*(x, -1)) ■22 x i9i( x ’- i ) p (y = i-i x ) 

\ 2=1 2=1 2=1 

m m m \ 

+1 (22 x i9i(*, f) < 22 Xi9i ^ _1 )) ■ ^2 Xi9i ^ i ) p (y = _l i x )) dp ( x ) 

1 i =1 

m 

I (error) E A i9i(x,-y) 

m 

22 Xi9i ^ ~ y "> 


2=1 


2=1 


(x,y) 


2=1 


> P(error) p E ( ^ y) 


\ —(p—!) 

P-1 


v 2=1 


the last inequality holds by reverse Holder inequality (Hardy et a h. 119521 1. So 
P(error) < E( x>y) 


_ 2=1 


E, 


( x .y) 


22 X i 9 i(^y) 

m 

22 a *5'*( x ) ~y) 


-1 • 
p-l 


V. 2 = 1 


P-l 

P 


and the result (flUD follows by plugging in \ — J g .( x for each i, the minimizer of 

E[£( A; g)j, which can be solved directly when £ is in the form (|8l) . ■ 
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Finally, in correspondence to Theorem [2J the standard SGD for (|HJ) is written as 

A‘ +1 = A* - 2 (e g ‘ - i) (14) 

where 7 is a parameter that controls the step size. The following result is a consequence of 
Theorem [2] (we give another proof here that reveals more specific details). 


Theorem 4 Suppose that g* arei.i.d., and 0 < E^ xy ^ [g ? ;(x. y)\ < 00 andVar^y. y ^{gi(gi.,y)) < 
00 . For each X ,, the posterior mean given by m always has a rate of convergence at least 
as fast as the SGD update (ED in terms of asymptotic variance. In fact, it is strictly better 
in all situations except when the step size parameter 7 in m is set optimally a priori. 


Proof Since for each i, gj are i.i.d., the sample mean (1/T) 
theorem. It can be argued using the delta method ( Serflind. 20091 ) 
m satisfies 


g* follows a central limit 
that the posterior mean 


Vf( 

n( 0 , 


a + T 


\P + 6Yj=i9i dE l9i(x,y)} 
Varjgii'x., y)) 

0 2 ( E l9i( x, y)}) 4 


(15) 


For the stochastic gradient descent scheme (fl4p . it would be useful to cast the objec¬ 
tive function as Zi(Xi) = E[6\igi — log A*]. Let A* = argmin A £j(A) which can be directly 
Then z/(A*) = —^ = 9 2 {E{gi(yi, y)]) 2 . If the step size 7 > 


solved as g E ^ g . j • ^iVH) ~ yju • 11 f 1 ^ 2 z"(\*) ’ 

the update scheme (11411 will generate A T that s atisfie s the following central limit theorem 
( Asmussen and Glvnn / EoQ7 : Knshner and Yin . 200, 'll ) 


VT(\J-\*)AN( 0 ,a 2 ) 


where 


o? = 


J e (i 2 7 2 "(A*)) S 7 2 v ar (egifx,y) - ds 


(16) 


(17) 


and #gj(x, y)—j -is the unbiased estimate of the gradient at the point A*. On the other hand, 
Xj — A* = oj p (-X=) if 7 < i- e - the convergence is slower than (fl 6 l) asymptotically 

and so we can disregard this case ( Asmussen and Glvnn . 20071 ). Now substitute A* = e ^ g , 
into (fl7l) to obtain 


a? = 


and let 7 = 7 /8 2 , we get 


8 2 'y 2 Var(gi(x,y )) / e^ 1 2l ^ Xi ^ s ds 

Jo 

0 2 7 2 Var(gi(x,y)) d 2 'y 2 Var(g i (x,y)) 


2 7 /A* - 1 


27 8 2 (E[gi(x,y )}) 2 - 1 


of = 


7 2 I Xar(gi(x,y)) 
8 2 (2 ; y(E[gi(x,y)]) 2 — 1 ) 


(18) 
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lf ^ > 2?%} - 2 (E[g^yW 

We are now ready to compare the asymptotic variance in (1151) and (I18[) . and show that 
for all 7 , the one in (1151) is smaller. Note that this is equivalent to showing that 

Var(gi(x,y)) ^ j 2 Var{ gi (x,y)) 

6\E[ gi ^y)}Y ~ d^(E[ 9l (x,yW-l) 

Eliminating the common factors, we have 

_I_<_ t _ 

(■ E\gi(x,y )]) 2 27-I /(E[ gi (x,y)}) 2 

and by re-arranging the terms, we have 


(Ehi^y )}) 2 



1 

0 %*(x,y )]) 2 


2 

> 0 


which is always true. Equality holds iff 7 = ( E [ g .( x 1 which corresponds to 7 = ffi( E [ g ^ x y ^yi ■ 

Therefore, the asymptotic variance in (1151) is always smaller than (1181) . unless the step size 
7 is chosen optimally. ■ 


5. Experiments 


We report two sets of binary classification experiments in the online learning setting. In the 
first set of experiments, we evaluate our scheme’s performance vs. five baseline methods: a 
single baseline classifier, a uniform voting ensemble, and three SGD based online ensemble 
learning methods. In the second set o f exper i ments, we co mpare with three leadi ng 011 - 
line boosting methods: GradientBoost ( Leistner et ah . ;2009il. Smooth-Boost ( Chen et al 


20121 ). and the online boosting method of Oza and Russell 


In all experiments, we follow the experimental setup in 


(20cnh, 


Chen et ah ( 20121 ). Data arrives 


as a sequence of examples (xi, t/i),..., (xy, y E )- At each step t the online learner predicts the 
class label for x*, then the true label yt is revealed and used to update the classifier online. 
We report the averaged error rate for each evaluated method over five trials of different 
random orderings of each dataset. The experiments are conducted for two different choices 
of weak classifiers: Perceptron and Naive Bayes. 

In all experiments, we choose the loss function g of our method to be the ramp loss, 
and set the hyperparameters of our method as a = (3 = 1 and 8 = 0.1. From the expression 
of the posterior mean (fTTj) . the prediction rule (fT2l) is unrelated to the values of a, (5 and 
8 in the longterm. We have observed that the classification performance of our method is 
not very sensitive with respect to changes in the settings of these parameters. However, the 
stochastic gradient descent baseline (SGD) (fT4l) is sensitive to the setting of 8, and since 
8 = 0.1 works best for SGD we also use 8 = 0.1 for our method. 


5.1 Comparison with Baseline Methods 

In the experimental evaluation, we compare our online ensemble method with five baseline 
methods: 


13 





























Bai, Lam and Sclaroff 


1. a single weak classifier (Perceptron or Naive Bayes), 

2. a uniform ensemble of weak classifiers (Voting), 

3. an ensemble of weak classifiers where the ensemble weights are estimated via standard 
stochastic gradient descent (SGD), 


4. a var iant of (3.) where the ensemble weights are estimated via Polyak averaging (jPolvak and Juditskvl . 

1993) (SGD-avg), and 


5. another variant of (3.) wher e the ensemble weig hts are estimated via the Stochastic 
Average Gradient method of Schmidt et ah ( 20131 ) (SAG). 


We use ten binary classification benchmark datasets obtained from the LIBSVM repos¬ 
itory. Each dataset is split into training and testing sets for each random trial, where a 
training set contains no more than 10% of the total amount of data available for that partic¬ 
ular benchmark. For each experimental trial, the ordering of items in the testing sequence 
is selected at random, and each online classifier ensemble learning method is presented with 
the same testing data sequence for that trial. 

In each experimental trial, for all ensemble learning methods, we utilize a set of 100 
pre-trained weak classifiers that are kept static during the online learning process. The 
training set is used in learning these 100 weak classifiers. The same weak classifiers are 
then shared by all of the ensemble methods, including our method. In order to make weak 
classifiers divergent, each weak classifier uses a randomly sampled subset of data features 
as input for both training and testing. The first baseline (single classifier) is learned using 
all the features. 

For all of the benchmarks we obse rved t h at th e error rate varies with different orderings 
of the dataset. Therefore, following Chen et al. f 2012! ). we report the average error rate 
over five random trials of different orders of each sequence. In fact, while the error rate 
may vary according to different orderings of a dataset, it was observed throughout all our 
experiments that the ranking of performance among different methods is usually consistent. 

Classification error rates for this experiment are shown in Tables Q] and [2j Our pro¬ 
posed method consistently performs the best for all datasets. Its superior performance 
against Voting is consistent with the asymptotic convergence analysis in Theorem [T] Its 
superior performance against the SGD baseline is consistent with the convergence rate 
analysis in Theorem [2 Polyak averaging (SGD-avg) does not impr ove t he performance 
of basic SGD in general; this is consistent with the analysis in IXul ( 201ll ) which showed 
that, despite its optimal asymptotic convergence rate, a huge number of samples may be 
needed for Pol yak averaging to reach its asymptotic region for a randomly chosen step size. 


SAG (ISchmidt et al. . 2013) is a close runner-up to our approach, but it has two limita¬ 


ti ons: 1) it requi res knowing the length of the testing sequence a priori , and 2) as noted 


m 


Schmidt et al. (|2013i ). the step size suggested in the theoretical analysis does not usually 


give the best result in practice, and thus the authors sug gest a larger s tep siz e instead. 
In our experiments, we also found that the improvement of Schmidt et al. ( 20131 ) over the 
SGD baseline relies on tuning the step size to a value that is greater than that given in the 
theory. The performance of SAG reported here has taken advantage of these two points. 


1. http://www.csie.ntu.edu.tw/'cjlin/libsvmtools/datasets/ 
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Table 1: Experiments of online classifier ensemble using pre-trained Perceptrons as weak 
classifiers and keeping them fixed online. Mean error rate over five random trials 
is shown in the table. We compare with five baseline methods: a single Perceptron 
classifier (Perceptron), a uniform ensemble scheme of weak classifiers (Voting), 
an ensemble scheme using SGD for estimating the ensemb le weights (SG D), an 
ensem ble scheme using the Polyak averaging scheme of SGD (IPolvak and JuditskvL 


1992 ) to estimate the ensemble weig hts (SG D -avg), and an ensemble scheme using 


the Stochastic Average Gradient ( Schmidt et al. . 2013) to estimate the ensemble 
weights (SAG). Our method attains the top performance for all testing sequences. 


Dataset 

# Examples 

Perceptron 

Voting 

SGD 

SGD-avg 

SAG 

Ours 

Heart 

270 

0.258 

0.268 

0.265 

0.266 

0.245 

0.239 

Breast-Cancer 

683 

0.068 

0.056 

0.056 

0.055 

0.055 

0.050 

Australian 

693 

0.204 

0.193 

0.186 

0.187 

0.171 

0.166 

Diabetes 

768 

0.389 

0.373 

0.371 

0.372 

0.364 

0.363 

German 

1000 

0.388 

0.324 

0.321 

0.323 

0.315 

0.309 

Splice 

3175 

0.410 

0.349 

0.335 

0.338 

0.301 

0.299 

Mushrooms 

8124 

0.058 

0.034 

0.034 

0.034 

0.031 

0.030 

Ionosphere 

351 

0.297 

0.247 

0.240 

0.241 

0.240 

0.236 

Sonar 

208 

0.404 

0.379 

0.376 

0.379 

0.370 

0.369 

SVMguide3 

1284 

0.382 

0.301 

0.299 

0.299 

0.292 

0.289 


Table 2: Experiments of online classifier ensemble using pre-trained Naive Bayes as weak 
classifiers and keeping them fixed online. Mean error rate over five random trials is 
shown in the table. We compare with five baseline methods: a single Naive Bayes 
classifier (Naive Bayes), a uniform ensemble scheme of weak classifiers (Voting), 
an ensemble scheme using SGD for estimating the ensemb le weights (SGD ), an 
ensem ble scheme using the Polyak averaging scheme of SGD (IPolvak and JuditskvL 


1992 ) to estimate the ensemble weig hts (SG D -avg), and an ensemble scheme using 


the Stochastic Average Gradient ( Schmidt et al. . 2013 1 to estimate the ensemble 
weights (SAG). Our method attains the top performance for all testing sequences. 


DATASET 

# Examples 

Naive Bayes 

Voting 

SGD 

SGD-avg 

SAG 

Ours 

Heart 

270 

0.232 

0.207 

0.214 

0.215 

0.206 

0.202 

Breast-Cancer 

683 

0.065 

0.049 

0.050 

0.049 

0.048 

0.044 

Australian 

693 

0.204 

0.201 

0.200 

0.200 

0.187 

0.184 

Diabetes 

768 

0.259 

0.258 

0.256 

0.256 

0.254 

0.253 

German 

1000 

0.343 

0.338 

0.338 

0.338 

0.320 

0.315 

Splice 

3175 

0.155 

0.156 

0.155 

0.155 

0.152 

0.152 

Mushrooms 

8124 

0.037 

0.066 

0.064 

0.064 

0.046 

0.031 

Ionosphere 

351 

0.199 

0.196 

0.195 

0.195 

0.193 

0.192 

Sonar 

208 

0.338 

0.337 

0.337 

0.337 

0.337 

0.336 

SVMguide3 

1284 

0.315 

0.316 

0.304 

0.316 

0.236 

0.215 
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Mushrooms - Perceptron Mushrooms - Naive Bayes 





0.04 


0 


100 200 300 400 500 600 700 

number of samples 


Breast-Cancer - Naive Bayes 



Australian - Perceptron 



Australian - Naive Bayes 



Figure 1: Plots of the error rate as online learning progresses for three benchmark 
datasets: Mushrooms, Breast-Cancer, and Australian. (Plots for other bench¬ 
marks datasets are provided in the supporting material.) The red curve in each 
graph shows the error rate for our method, as a function of the number samples 
processed in the online learning of ensemble weights. The cyan curves are results 
from SGD baseline , the green curves are re sults from the Polyak averaging base¬ 
line SGD-AVG (jPolvak and .Juditskvl . Il992l ). and the blue curves are r esults from 


the Stochastic Average Gradient baseline SAG ( Schmidt et ah . 20131 ). 
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Figure 2: Experiments to evaluate different settings of /3 for our online classifier ensem¬ 
ble method, using pre-trained Perceptrons and Naive Bayes as weak classifiers. 
The mean error rate is computed over five random trials for the “Heart” and 
“Mushrooms” datasets. These results are consistent with all other benchmarks 
tested. 
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Figure 3: Experiments to evaluate different settings of 9 for our online classifier ensem¬ 
ble method, using pre-trained Perceptrons and Naive Bayes as weak classifiers. 
The mean error rate is computed over five random trials for the “Heart” and 
“Mushrooms” datasets. These results are consistent with all other benchmarks 
tested. 


Fig. CD shows plots of the convergence of online learning for three of the benchmark 
datasets. Plots for the other benchmark datasets are provided in the supplementary ma¬ 
terial. Each plot reports the classificati on err or c ur ves of our method, the SGD baseline, 
Polyak averaging SGD-avg ( Polvak and .Tuditskv , 1993), and Stochastic Average Gradient 


SAG ( Schmidt et ah . 20131 ). Overall, for all methods, the error rate generally tends to 


decrease as the online learning process considers more and more samples. As is evident in 
the graphs, our method tends to attain lowest error rates overall, throughout each training 
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sequence, for the compared methods for these benchmarks. Ideally, as an algorithm con¬ 
verges, the rate of cumulative error should tend to decrease as more samples are processed, 
approaching the minimal error rate that is achievable for the given set of pre-trained weak 
classifiers. Yet given the finite size of training sample set, and the randomness caused by 
different orderings of the sequences, we may not see the ideal monotonic curves. But in 
general, the trend of curves obtained by our method is consistent with the convergence 
analysis of Theorem 1. The online learning algorithm that converges faster should result 
in curves that go down more quickly in general. Again, given finite samples and different 
orderings, there is variance, but still, consistent with Theorem 2, the consistently better 
performance of our formulation vs. the compared methods is evident. 

Fig [2] and Fig. [3] show plots for studying the sensitivity of parameter settings of our 
method. It is clear from the expression of the posterior mean dill) that the numerator 
containing a will be cancelled out in the prediction rule (fl2|) . therefore we just need to 
study the effect of ft and 8. We select a short sequence, “Heart” and a long sequence, 
“Mushrooms” as two representative datasets. We plot the classification error rates of our 
method under different settings of /3 (Fig. [2]) and 8 (Fig. [3|), averaged over five random 
trials. It can be observed that the performance of our method is not very sensitive with 
respect to the changes in the settings of /3 and 8 even for a short sequence like “Heart” 
(270 samples). And the performance is more stable to the settings of these parameters for 
longer sequence like “Mushrooms” (8124 samples). This observation is consistent with the 
asymptotic property of our prediction rule (11211 . We observed similar behavior for all the 
other benchmark datasets we tested. 


5.2 Comparison with Online Boosting Methods 


We further compare our method with a single Perceptron/Nai've Bayes class ifier that is 


updat ed online, and three representative online boosting methods repo rted in I Chen et al 


( 2012h : OzaBoost is the method proposed bv lOza and Russel] (200lh . OGBOOST is the 
online GradientBoost method proposed by Leistne r et al, ( 2009h . and OSBoost is the 


online Smooth-Boost method proposed by Chen et al. ( 2012l l. OuRS-R is our proposed 


Bayesian ensemble method for onli ne updated wea k classifiers. All methods are trained and 
compared following the setup of Chen et al. (j2012|), where for each experimental trial, a set 
of 100 weak classifiers are initialized and updated online. 


We use ten binary classification benchmark datasets that are also used by 
(2012). We discard the “Ijcnnl” and “Web Page” datasets from the tables of 
( 20121 ). because they are highly biased with portions of positive samples around 0.09 and 
0.03 respectively, and even a naive “always negative” classifier attains comparably top 
performance. 


Chen et al. 

Chen et al. 


The error rates for this experiment are shown in Tables [3] and [4j As can be seen, our 
method outperforms competing methods using the Perceptron weak classifier in nearly all 
the benchmarks tested. Moreover, our method performs among the best for the Naive Bayes 
weak classifier. It is worth noting that our method is the only one that outperforms the 
single classifier baseline in all benchmark datasets, which further confirms the effectiveness 
of the proposed ensemble scheme. 
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Table 3: Experiments of online classifier ensemble using online Perceptrons as weak clas¬ 
sifiers that are updated online. Mean error rate over five trials is shown in the 
table. We compare with a single online Perceptron classi fier ( Perceptron ) and 
three representative online boosting methods reported i n IChen et al.1 (|2012l h Oz- 
aBoost is the method proposed by Oza and Russell (20011. OGBoost is the 


online GradientBoost method proposed bv iL eistner et al 


M), and OSBoost 
is the online Smooth-Boost method proposed by IChen et al. (12012 1. Our method 
(Ours-R) attains the top performance for most of the testing sequences. 


DATASET 

# EXAMPLES 

Perceptron 

OzaBoost 

OGBoost 

OSBoost 

OuRS-R 

Heart 

270 

0.2489 

0.2356 

0.2267 

0.2356 

0.2134 

Breast-Cancer 

683 

0.0592 

0.0501 

0.0445 

0.0466 

0.0419 

Australian 

693 

0.2099 

0.2012 

0.1962 

0.1872 

0.1655 

Diabetes 

768 

0.3216 

0.3169 

0.3313 

0.3185 

0.3098 

German 

1000 

0.3256 

0.3364 

0.3142 

0.3148 

0.3105 

Splice 

3175 

0.2717 

0.2759 

0.2625 

0.2605 

0.2584 

Mushrooms 

8124 

0.0148 

0.0080 

0.0068 

0.0060 

0.0062 

Adult 

48842 

0.2093 

0.2045 

0.2080 

0.1994 

0.1682 

Cod-RNA 

488565 

0.2096 

0.2170 

0.2241 

0.2075 

0.1934 

COVERTYPE 

581012 

0.3437 

0.3449 

0.3482 

0.3334 

0.3115 


Table 4: Experiments of online classifier ensemble using online Naive Bayes as weak clas¬ 
sifiers that are updated online. Mean error rate over five trials is shown in the 
table. We compare with a single online Naive Bayes classifi er (N aive B ayes ) 
and three representative online boostin g methods reported in Chen et aD ( 20121 ). 
OzaBoost is the method proposed by Oza and Russelll (2001 ). OGBoost is the 
online GradientBoost method proposed by Leistner etali ( 20091) , and OSBoost 
is the online Smooth-Boost method proposed by Che n et al. (120121 ). Our method 
(Ours-R) attains the top performance for 7 out of 10 testing sequences. For “Cod- 
RNA” our implementation of the Naive Bayes baseline was unable to duplicate 
the reported result; ours gave 0.2555 instead. 


DATASET 

# EXAMPLES 

Naive Bayes 

OzaBoost 

OGBoost 

OSBoost 

OURS-R 

Heart 

270 

0.1904 

0.2570 

0.3037 

0.2059 

0.1755 

Breast-Cancer 

683 

0.0474 

0.0635 

0.1004 

0.0489 

0.0408 

Australian 

693 

0.1751 

0.2133 

0.2826 

0.1849 

0.1611 

Diabetes 

768 

0.2664 

0.3091 

0.3292 

0.2622 

0.2467 

German 

1000 

0.2988 

0.3206 

0.3598 

0.2730 

0.2667 

Splice 

3175 

0.2520 

0.1563 

0.1863 

0.1370 

0.1344 

Mushrooms 

8124 

0.0076 

0.0049 

0.0229 

0.0029 

0.0054 

Adult 

48842 

0.2001 

0.1912 

0.1878 

0.1581 

0.1658 

Cod-RNA 

488565 

0.2206* 

0.0796 

0.0568 

0.0581 

0.2552 

Covertype 

581012 

0.3518 

0.3293 

0.3732 

0.3634 

0.3269 
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We also note that despite our best efforts to alig n both the weak c l assifi er const ruction 


and experimental setup with competing methods ( Chen et ah . 2012 : Chen . 20131 ). there 


are inevitably differences in weak classifier construction. Firstly, given that our method 
only focuses on optimizing the ensemble weights, each incoming sample is treated equally 
in the update of all weak classifiers, while all three online boosting methods adopt more 
sophisticated weighted update schemes for the weak classifiers, where the sample weight 
is dynamically adjusted during each round of update. Secondly, in order to make weak 
classifiers different from each other, our weak classifiers use only a subset of input features, 
while weak classifiers of competing methods use all features and are updated differently. 
As a result, the weak classifiers used by our method are actually weaker than in competing 
methods. Nevertheless, our method often compares favorably. 


6. Additional Loss Functions for Online Ensemble Learning 

We discuss other loss functions that fit into our Bayesian online ensemble learning frame¬ 
work. Note that the loss function ((§]) given in Section |4] is very simple, to the extent that 
the surrogate empirical loss (JT]) at each step can be directly minimized in closed-form. To 
demonstrate the flexibility of our framework, the empirical losses in the two examples we 
give below cannot be minimized directly, but they are still effectively solvable using our 
approach. 

1. Consider the loss function 

m m 

^(A;g) = 5^(1 - A;) log3; + e^2gi 

i— 1 2 =1 

m m 

+ ^lo g r(A l )-(lo g 0)^A, 

2 = 1 2=1 

where 9 > 0 is a fixed parameter. The corresponding likelihood is given by the 
following product of Gamma distributions 


(19) 


p( gi^)=n 

2=1 


0 \i 

rw 


~Ai-l 


V 


e -e ai 


A conjugate prior for A is available, in the form 


p(\) oc Yl 
2=1 


a Ai 1 9 cXi 

r(A ) 6 


( 20 ) 


where a, b, c > 0 are hyperparameters. The posterior distribution of A after t steps is 
given by the Gamma distribution 


P( A|g i;< ) oc 


n 


(a n 

S=1 

r(A i)(p+t) 


( 21 ) 
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Note that given posterior (I21|) . the posterior mean for each Aj is not available in closed- 
form, but it can be computed using standard numerical integration procedures, such as 
those provided in the Matlab Mathematics Toolbox (it only involves one-dimensional 
procedures because of the independence among the A). The corresponding prediction 
rule at each step is given by 


V = 


m . m 

1 if E (! - Ai) log + # E 1) ~ 9i( x, -1)) < 0 

i= 1 ’ i =1 

— 1 otherwise 


Note that the likelihood function (12011 of g is a Gamma distribution, which has support 
(0, oo). For computational convenience, instead of choosing the ramp loss for g as in 
Section [4] we can choose g to be the logistic function. 

2. We can extend the ensemble weights to include two correlated parameters for each 
weight, i.e., A* = (aj,/3j). In this case, we may define the loss function as 

m m 

t{oL,P\ g) = + 5Z(1 “ a *)l° g 0i 

i =1 i =1 

m m 

+ ^ lo g r (ai) - ^2 ati log fy (22) 

1=1 1=1 

with the corresponding Gamma likelihood 


p{g\a,{3) 


n 


r(«i) 


gOt-i lg PiQi 


A conjugate prior is available for oc and f3 jointly 


m 

p(a,f3) oc JJ 

i=l 


rpOt-i 1 g 


(23) 


where p, q, r, s are hyperparameters. The posterior distribution of a and (3 after t 
steps is given by the Gamma distribution 


p(a,/3|g i:t ) oc 


m (p FI ^) ai - 1 e“ (9+ ^=l9l)Pi 

TT —_ 

r(a i )( r + t )^ i ai(s+t) 


(24) 


Again, the posterior mean for (|24|) is not available in closed-form and we can approx¬ 
imate it using numerical methods. The corresponding prediction rule at each step is 
given by 


V = 


m . . m 

1 if E(f - a i) log gf(l X -i) + E Pi(9i( x , f) - g»( x » -f)) < 0 

i =1 * ’ «=1 

— 1 otherwise 


21 







Bai, Lam and Sclaroff 


Note that both of these two loss functions satisfy Assumption [lj Similar as the example 
proposed in Section [4] the Hessian of Lt turns out to not depend on g l ' T , therefore all 
conditions of Assumption |T] can be verified easily. As a result, applying Algorithm [T| on 
these two loss functions for solving the online ensemble learning problem also possesses the 
convergence properties given by Theorems Q] and [2j 

We follow the experimental setup of Section lSTl to compare our proposed loss (JHJ) with the 
additional losses (fT9j) and (l22j1 discussed here, using pre-trained Perceptron and Naive Bayes 
as weak classifiers. The loss function g for weak classifier c is chosen as a logistic function of 
y-c(x). According to the posterior update rules given in (l2Tj> and (|24l) . hyper parameters b, c 
and r, s will keep increasing as online learning proceeds. However, we observe in practice 
that the numerical integration of posterior means based on posterior distributions m 
and (1241) will not converge if the values of hyper parameters b, c, r, s are too large. In our 
experiments, we set upper bounds for these parameters. In particular, we set the upper 
bound for b and c as 1000, the upper bound for r and s as 200.5 and 200 respectively 
(Since s should be strictly less than r, we use the following initialization: s = 1, r = 1.5, as 
suggested by Fink , 1997). 

Averaged classification error rate over five trials for this experiment is shown in Table [5j 
Note that the result in this table should not be directly compared with those reported 
in Tables Q] and [2], given the loss function g for weak classifiers is chosen differently. We 
observe that loss (I22|) works slightly better than loss (119j) . which is reasonable given more 
parameters in the formula of (|22D . This advantage also leads to a superior performance 
to loss © proposed in Section 4 for shorter sequences, such as “Heart”, “Ionosphere” and 
“Sonar”. However, for longer sequences, loss © still has some advantage because of the 
closed-form posterior mean. 


Table 5: Experiments of online classifier ensemble using pre-trained Perceptrons/Na'ive 
Bayes as weak classifiers and keeping them fixed online. Mean error rate over 
five random trials is shown in the table. We compare our method using the pro¬ 
posed loss function © with alternative losses defined by (1191) and (1221) . In general, 
the loss function © that enables closed-form posterior mean performs the best. 




Perceptron weak learner 

Naive Bayes weak learner 

DATASET 

# EXAMPLES 

loss (JSJ) 

LOSS m9j) 

LOSS (J22]) 

LOSS USD 

loss m 

LOSS ([221) 

Heart 

270 

0.203 

0.208 

0.198 

0.197 

0.204 

0.196 

Breast-Cancer 

683 

0.065 

0.070 

0.068 

0.045 

0.050 

0.046 

Australian 

693 

0.183 

0.207 

0.200 

0.191 

0.209 

0.203 

Diabetes 

768 

0.301 

0.307 

0.300 

0.285 

0.287 

0.284 

German 

1000 

0.338 

0.347 

0.348 

0.292 

0.292 

0.293 

Splice 

3175 

0.390 

0.418 

0.418 

0.144 

0.150 

0.150 

Mushrooms 

8124 

0.028 

0.032 

0.031 

0.025 

0.047 

0.046 

Ionosphere 

351 

0.293 

0.295 

0.259 

0.171 

0.172 

0.171 

Sonar 

208 

0.385 

0.391 

0.380 

0.301 

0.302 

0.303 

SVMguide3 

1284 

0.265 

0.278 

0.276 

0.222 

0.226 

0.225 
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7. Conclusion 

We proposed a Bayesian approach for online estimation of the weights of a classifier ensem¬ 
ble. This approach was based on an empirical risk minimization property of the posterior 
distribution, and involved suitably choosing the likelihood function based on a user-defined 
choice of loss function. We developed the theoretical foundation, and identified the class of 
loss functions, for which the update sequence generated by our approach converged to the 
stationary risk minimizer. We demonstrated that, unlike standard SGD, the convergence 
guarantee was global and that the rate was optimal in a well-defined asymptotic sense. 
Moreover, experiments on real-world datasets demonstrated that our approach compared 
favorably to state-of-the-art SGD methods and online boosting methods. In future work, 
we will study further generalization of the scope of loss functions, and the extension of our 
framework to non-stationary environments. 

References 

S. Asmussen and P. W. Glynn. Stochastic simulation: Algorithms and analysis. Springer, 
2007. 

B. Babenko, M. H. Yang, and S. Belongie. Visual tracking with online multiple instance 
learning. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 
pages 983-990, 2009a. 

B. Babenko, M. H. Yang, and S. Belongie. A family of online boosting algorithms. In ICCV 
Workshops, pages 1346-1353, 2009b. 

Qinxun Bai, Henry Lam, and Stan Sclaroff. A bayesian framework for online classifier 
ensemble. In Proc. International Conf. on Machine Learning (ICML), 2014. 

N. Cesa-Bianchi and G. Lugosi. Potential-based algorithms in on-line prediction and game 
theory. Machine Learning, pages 239-261, 2003. 

C. F. Chen. On asymptotic normality of limiting density functions with bayesian implica¬ 
tions. Journal of the Royal Statistical Society, pages 540-546, 1985. 

S. T. Chen, personal communication, 2013. 

S. T. Chen, H. T. Lin, and C. J. Lu. An online boosting algorithm with theoretical justi¬ 
fications. In Proc. International Conf. on Machine Learning (ICML), pages 1007-1014, 
2012. 

R. Durrett. Probability Theory and Examples. Cambridge Series in Statistical and Proba¬ 
bilistic Mathematics, 4th edition, 2010. 

Daniel Fink. A compendium of conjugate priors. 1997. 

Y. Freund and R. E. Schapire. A desicion-theoretic generalization of on-line learning and 
an application to boosting. In Computational learning theory, pages 23-37, 1995. 


23 


Bai, Lam and Sclaroff 


J. H. Friedman. Greedy function approximation: a gradient boosting machine. Annals of 
Statistics, pages 1189-1232, 2001. 

H. Grabner and H. Bischof. On-line boosting and vision. In Proc. IEEE Conf. on Computer 
Vision and Pattern Recognition (CVPR), pages 260-267, 2006. 

H. Grabner, C. Leistner, and H. Bischof. Semi-supervised on-line boosting for robust track¬ 
ing. In Proc. European Conf. on Computer Vision (ECCV), pages 234-247. 2008. 

M. Grbovic and S. Vucetic. Tracking concept change with incremental boosting by mini¬ 
mization of the evolving exponential loss. In Machine Learning and Knowledge Discovery 
in Databases , pages 516-532. 2011. 

G. H. Hardy, J. E. Littlewood, and G. Polya. Inequalities. Cambridge university press, 
1952. 

J. A. Hoeting, D. Madigan, A. E. Raftery, and C. T. Volinsky. Bayesian model averaging: 
a tutorial. Statistical science , pages 382-401, 1999. 

Jiaqiao Hu, Michael C Fu, and Steven I Marcus. A model reference adaptive search method 
for global optimization. Operations Research, 55(3):549-568, 2007. 

J. Z. Kolter and M. A. Maloof. Dynamic weighted majority: An ensemble method for 
drifting concepts. Journal of Machine Learning Research, pages 2755-2790, 2007. 

J.Z. Kolter and M.A. Maloof. Using additive expert ensembles to cope with concept drift. 
In Proc. International Conf. on Machine Learning (ICML), pages 449-456, 2005. 

H. J. Kushner and G. Yin. Stochastic approximation and recursive algorithms and applica¬ 
tions. Springer, 2003. 

C. Leistner, A. Saffari, P. M Roth, and H. Bischof. On robustness of on-line boosting-a 
competitive study. In ICCV Workshops, pages 1362-1369, 2009. 

X. Liu and T. Yu. Gradient feature selection for online boosting. In Proc. IEEE Interna¬ 
tional Conf. on Computer Vision (ICCV), pages 1-8, 2007. 

David JC MacKay. Information theory, inference and learning algorithms. Cambridge 
university press, 2003. 

L. Mason, J. Baxter, P. Bartlett, and M. Frean. Boosting algorithms as gradient descent in 
function space. In NIPS, 1999. 

L.L. Minku. Online ensemble learning in the presence of concept drift. PhD thesis, Univer¬ 
sity of Birmingham, 2011. 

N. C. Oza. Online ensemble learning. PhD thesis, University of California, Berkeley, 2001. 

N. C. Oza and S. Russell. Online bagging and boosting. In AISTATS, pages 105-112, 2001. 

R. Pasupathy and S. Kim. The stochastic root-finding problem: overview, solutions, and 
open questions. ACM Trans, on Modeling and Computer Simulation, 21(3):19, 2011. 


24 


A Bayesian Approach for Online Classifier Ensemble 


R. Pelossof, M. Jones, I. Vovsha, and C. Rudin. Online coordinate boosting. In ICCV 
Workshops, pages 1354-1361, 2009. 

Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by aver¬ 
aging. SIAM Journal on Control and Optimization, 30(4):838-855, 1992. 

Reuven Y Rubinstein and Dirk P Kroese. The cross-entropy method: a unified approach 
to combinatorial optimization, Monte-Carlo simulation and machine learning. Springer, 
2004. 

A. Saffari, M. Godec, T. Pock, C. Leistner, and H. Bischof. Online multi-class lpboost. In 
Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 3570- 
3577, 2010. 

Robert E Schapire. Drifting games. Machine Learning, pages 265-291, 2001. 

Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochas¬ 
tic average gradient. arXiv preprint arXiv:1309.2388, 2013. 

R. J. Serfling. Approximation theorems of mathematical statistics. Wiley, com, 2009. 

M. Telgarsky. A primal-dual convergence analysis of boosting. Journal of Machine Learning 
Research, pages 561-606, 2012. 

H. Wang, W. Fan, P.S. Yu, and J. Han. Mining concept-drifting data streams using ensemble 
classifiers. In Proc. ACM SIGKDD Conf. on Knowledge Discovery and Data Mining 
(KDD), pages 226-235, 2003. 

Wei Xu. Towards optimal one pass large scale learning with averaged stochastic gradient 
descent. arXiv preprint arXiv: 1107.2490, 2011. 

Mark Zlochin, Mauro Birattari, Nicolas Meuleau, and Marco Dorigo. Model-based search 
for combinatorial optimization: A critical survey. Annals of Operations Research, 131 
(l-4):373-395, 2004. 


25 


